Goto

Collaborating Authors

 sagemaker debugger


Improving Mask RCNN Convergence with PyTorch Lightning and SageMaker Debugger

#artificialintelligence

MLPerf training times represent the state of the art in machine learning performance, in which AI industry leaders publish their best training times for a set of common machine learning models. But optimizing for training speed means these models are often complex, and difficult to move to practical applications. Last year, we published SageMakerCV, a collection of computer vision models based on MLPerf, but with added flexibility and optimization for use on Amazon SageMaker. The recently published MLPerf 2.0 adds a series of new optimizations. In this blog, discuss those optimizations, and how we can use PyTorch Lightning and the SageMaker Debugger to further improve training performance and flexibility.


The science behind SageMaker's cost-saving Debugger

#artificialintelligence

A machine learning training job can seem to be running like a charm, while it's really suffering from problems such as overfitting, exploding model parameters, and vanishing gradients, which can compromise model performance. Historically, spotting such problems during training has required the persistent attention of a machine learning expert. The Amazon SageMaker team has developed a new tool, SageMaker Debugger, that automates this problem-spotting process, saving customers time and money. For example, by using Debugger, one SageMaker customer reduced model size by 45% and the number of GPU operations by 33%, while improving accuracy. Next week, at the Conference on Machine Learning and Systems (MLSys), we will present a paper that describes the technology behind SageMaker Debugger.


Utilizing XGBoost training reports to improve your models

#artificialintelligence

In 2019, AWS unveiled Amazon SageMaker Debugger, a SageMaker capability that enables you to automatically detect a variety of issues that may arise while a model is being trained. SageMaker Debugger captures model state data at specified intervals during a training job. With this data, SageMaker Debugger can detect training issues or anomalies by leveraging built-in or user-defined rules. In addition to detecting issues during the training job, you can analyze the captured state data afterwards to evaluate model performance and identify areas for improvement. This task is made easier with the newly launched XGBoost training report feature.


Analyzing open-source ML pipeline models in real time using Amazon SageMaker Debugger

#artificialintelligence

Open-source workflow managers are popular because they make it easy to orchestrate machine learning (ML) jobs for productions. Taking models into productions following a GitOps pattern is best managed by a container-friendly workflow manager, also known as MLOps. Kubeflow Pipelines (KFP) is one of the Kubernetes-based workflow managers used today. However, it doesn't provide all the functionality you need for a best-in-class data science and ML engineer experience. A common issue when developing ML models is having access to the tensor-level metadata of how the job is performing.


Recap of AWS re:Invent 2020

#artificialintelligence

This year the annual re:invent conference organized by AWS was virtual, free and three weeks long. During multiple keynotes and sessions, AWS announced new features, improvements and cloud services. Below is a review of the main announcements impacting compute, database, storage, networking, machine learning and development. On the very first day of the conference, Amazon announced EC2 Mac instances for macOS, adding after many years a new operating system to EC2. This is mainly targeted to processes that only run on Mac OS, like building and testing applications for iOS, MacOS, tvOS and Safari.


New – Profile Your Machine Learning Training Jobs With Amazon SageMaker Debugger

#artificialintelligence

Today, I'm extremely happy to announce that Amazon SageMaker Debugger can now profile machine learning models, making it much easier to identify and fix training issues caused by hardware resource usage. Despite its impressive performance on a wide range of business problems, machine learning (ML) remains a bit of a mysterious topic. Getting things right is an alchemy of science, craftsmanship (some would say wizardry), and sometimes luck. In particular, model training is a complex process whose outcome depends on the quality of your dataset, your algorithm, its parameters, and the infrastructure you're training on. As ML models become ever larger and more complex (I'm looking at you, deep learning), one growing issue is the amount of infrastructure required to train them.